Stat 331: Final Project

Author

Alea Seifert, Robelyn Felices, Franchesca Garcia, and Libby Brill

Introduction

Data Description and Cleaning

For our final project, we chose to focus on how carbon dioxide (CO\(_{2}\)) emissions impacts life expectancy. We used data sets taken from Gapminder, which have been combined and collected from a variety of sources. The carbon dioxide emissions data set contains 194 observations (countries) with 224 variables (ranging in year from 1800 to 2022). These data represent the recorded consumption-based CO\(_{2}\) emissions, in tonnes of CO\(_{2}\) per capita. The life expectancy data set contains 195 observations (countries) with 302 variables (ranging in year from 1800 to 2100). The life expectancy data was collected from three main sources including 100 sources compiled by Mattias Lindgren from 1800-1970, the Global Burden of Disease Study from the Institute for Health Metrics and Evaluations containing data for 1970 to 2016, and World Population Prospects from the UN forecasts which published interpolated demographics from 2017-2099. These data represent the number of years a newborn infant would live assuming the mortality rate at their birth remains constant throughout their life.

Code
# Data
co2_emissions <- read_csv(here::here("Datasets", "co2_pcap_cons.csv"))
life_expectancy <- read_csv(here::here("Datasets", "lex.csv"))

# Pivoting Longer
co2_emissions_long <- co2_emissions |>
  mutate(across(`1800`:`2022`, ~ str_replace_all(.x, "−", "-"))) |>
  mutate(across(`1800`:`2022`, ~ as.numeric(.x))) |>
  pivot_longer(
    cols = `1800`:`2022`,
    names_to = "year",
    values_to = "co2"
  )

life_exp_long <- life_expectancy |>
  mutate(across(`1800`:`2100`, ~ as.numeric(.x))) |>
  pivot_longer(
    cols = `1800`:`2100`,
    names_to = "year",
    values_to = "life_exp"
  )

# Join Datasets
project_data <- co2_emissions_long |>
  inner_join(life_exp_long,
    by = c("country", "year")
  ) |>
  mutate(year = as.numeric(year))

We proceeded by pivoting both data sets into long format, then consolidating them into a singular data set containing variables for country, year, CO\(_{2}\) emissions, and life expectancy.

While pivoting the data, we received some warnings that resulted from R reading in the negative dashes as character values as opposed to numeric, specifically in our CO\(_{2}\) emissions data. To fix this issue, we replaced all dash symbols with the appropriate negative sign. This process was only applied for the CO\(_{2}\) emissions data. The year variable was also converted to numeric. Overall, we cleaned the data to create a usable data set to investigate how how carbon dioxide emissions impacts life expectancy.

Method

We analyzed the relationship between CO\(_{2}\) emissions and life expectancy, using CO\(_{2}\) emissions as the explanatory variable and life expectancy as the response variable. We expected a negative relationship between CO\(_{2}\) emissions and life expectancy, where countries that produce greater amounts of CO\(_{2}\) tend to have lower life expectancy. We assumed CO\(_{2}\) emissions decrease air quality and caused potential respiratory issues, decreasing life expectancy.

We assessed this relationship by using a Simple Linear Regression (SLR) to fit a linear model between our variables averaged across time. By using SLR, we assumed that there is a linear relationship between average CO\(_{2}\) emissions and average life expectancy per country, and we can use that to estimate a country’s average life expectancy based on their average CO\(_{2}\) emissions. This function is of the form: \(\widehat{response} = intercept + slope*explanatory\). That is, we can predict the average life expectancy of a country by multiplying their average CO\(_{2}\) emissions by the model slope and adding the model intercept.

It is important to note that not all assumptions necessary for SLR are not met with our data:

We can assume independence because we are observing average values of life expectancy per country. Each country should generally have its own state of conditions that determine life expectancy, however we should be weary of a possible lack of independence. If we did not average our data, meaning there are multiple observational rows per country, then independence would be violated.

We can assume normality because a histogram of average life expectancy per country is fairly bell shaped. The distribution is slightly right skewed, but our sample size of countries is large (\(n\) = 186) so should still be reliable.

We can assume equal variance because the points of a residuals versus fits based on a SLR model, although shift up and down, do not appear to fan in or out.

However, we cannot assume linearity because there is a slight curve seen in the visualization of the relationship between CO\(_{2}\) emissions and life expectancy in our data (Figure 2), as well as in the linear model’s residuals versus fits plot. Thus, we should proceed with caution when interpreting results produced by a linear regression model.

Results

Data Visualization

Our data immediately contrasted our hypothesis by showing a positive relationship between CO\(_{2}\) emissions and life expectancy, as opposed to a negative relationship like we expected. We investigated how CO\(_{2}\) emissions and life expectancy per country change over time:

Code
# Visualization 1:
animated <- project_data |>
  ggplot(aes(x = co2, 
             y = life_exp)) +
  geom_point(alpha = 0.7, show.legend = FALSE,
             na.rm = TRUE) +
  labs(title = "Average CO2 Emissions vs. Life Expectancy per Country in Year: {round(frame_time)}",
       x = "CO2 Emissions (tonnes per capita)",
       y = "",
       subtitle = "Life Expectancy (years)") +
  transition_time(year) +
  ease_aes("linear") +
  theme_bw()

animate(animated, renderer = gifski_renderer(), fps = 7)
<<<<<<< HEAD

Figure 1.
=======

Figure 1.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f

Figure 1 depicts the relationship between the explanatory variable, carbon dioxide emissions per capita in tonnes, and the response variable, life expectancy in years, for each country over years 1800 to 2022. It appears that with each year, both CO\(_{2}\) emissions and life expectancy increase. They likely both increase over time due to technological advancements. CO\(_{2}\) emissions have likely increased burning of fossil fuels from developments like cars. While the effects of CO\(_{2}\) emissions can create an impact on an individuals health, society also still progresses in fields such as health care and advancements in medicine. To further investigate, we analyzed the relationship between the CO\(_{2}\) emissions and life expectancy per country averaged over the years:

Code
# Averaged project data
country_data <- project_data |>
  group_by(country) |>
  summarize(avg_co2 = mean(co2),
            avg_life_exp = mean(life_exp)) 

# Visualization 2:
country_data |>
  ggplot(aes(x = avg_co2,
             y = avg_life_exp)) +
  geom_jitter(na.rm = TRUE) +
  geom_smooth(method = "lm", na.rm = TRUE) +
  theme_bw() +
  labs(title = expression("Average CO"*""[2]*" Emissions and Life Expectancy per Country"),
    x = expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),
    y = NULL,
    subtitle = "Average Life Expectancy (years)")
<<<<<<< HEAD

Figure 2.
=======

Figure 2.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f

Figure 2 depicts the average carbon dioxide emissions per capita in tonnes per country and the average life expectancy per country over the years 1800 to 2022. It appears that there is a positive, weak to moderate somewhat linear relationship between the two variables. There is an apparent increase in life expectancy as CO\(_{2}\) emissions increase with a curved pattern where there is a steep slope, but then a plateau.

This positive relationship can be explained because, as mentioned, over time there have been continuous technological developments that require increased burning of fossil fuels which emit CO\(_{2}\), and many allow access to improved healthcare and sanitation. The latter can contribute to the positive growth in the estimated life expectancy, even if CO\(_{2}\) emissions themselves would have a negative effect. It appears that there are potential confounding variables between these two variables.

Linear Regression Model

To further analyze our data, we fit a simple linear regression model to the data. We then analyzed whether the resulting model was a good predictor of our data:

Code
# Model:
project_lm <- lm(avg_life_exp ~ avg_co2,
                 data = country_data)

# Coefficients:
project_lm |>
  broom::tidy() |>
  mutate(p.value = scales::pvalue(p.value)) |>
  knitr::kable(digits = 3,
               caption = "Figure 3.",
               col.names = c("Term",
                             "Estimate",
                             "Standard Error", 
                             "Statistic",
                             "P-value"),
               align=c(rep('c',times=6)))
Figure 3.
Term Estimate Standard Error Statistic P-value
(Intercept) 41.149 0.404 101.908 <0.001
avg_co2 1.806 0.155 11.641 <0.001

Which provides the equation: \(\widehat{y} = 41.149 + 1.806x\)

Our predicted response variable (\(\widehat{y}\)) is the estimated average life expectancy per country, based on our explanatory variable (\(x\)), average CO\(_{2}\) emissions in tonnes per capita. The intercept in our simple linear model provided us with the information that when carbon dioxide emissions are zero, the estimated average life expectancy is 41.149 years. The slope term in our model is stating that for every additional one tonne per capita of CO\(_{2}\) emissions, the estimated average life expectancy goes up by 1.806 years, on average.

Model Fit

To assess the fit of our linear regression model, we calculated the variance in the response, fitted values, and residuals. Additionally, we analyzed the proportion of the variability in the response values was accounted for by the regression model:

Code
augment(project_lm) |>
  summarize(var_response = var(avg_life_exp),
            var_fit = var(.fitted),
            var_resid = var(.resid)) |>
  mutate(var_prop = var_fit/var_response) |>
  knitr::kable(digits = 3,
               caption = "Figure 4.",
               col.names = c("Variance in Response",
                             "Variance in Fitted",
                             "Variance in Residual", 
                             "Proportion"),
               align=c(rep('c',times=4)))
Figure 4.
Variance in Response Variance in Fitted Variance in Residual Proportion
33.054 14.018 19.036 0.424

The table above shows that the 14.018 represents the amount of variability in the average life expectancy that was accounted for by CO\(_{2}\) emissions. The value 19.036 is the amount of variability in life expectancy that has not been accounted for.

Figure 4 also shows 42.4% of variability in life expectancy is explained by our linear regression model using CO\(_{2}\) emissions as a predictor (\(14.018/33.054 = 0.424\)). This is a relatively moderate percentage of variability that can be explained by the model. This means the model is not the best representation of the variability in average life expectancy as it accounts for less than half of the variability.

Simulation

To further investigate whether the linear regression model would be an appropriate descriptor of the data, we simulated data to assess the quality of our linear model by comparing them to the observed data:

# observed plot
country_data |>
  ggplot(aes(x = avg_co2,
             y = avg_life_exp)) +
  geom_point(na.rm = TRUE) +
  theme_bw() +
  labs(title = "Observed Life Expectancy per Country",
       x = expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),
       y = NULL,
       subtitle = "Average Life Expectancy (years)") +
  ylim(25, 70)
# simulated plot
simulation <- as.vector(predict(project_lm) + rnorm(n = 186, mean = 0, sd = sigma(project_lm)))
sim_data <- country_data |>
  filter(!is.na(avg_life_exp)) |>
  mutate(sim = simulation)

sim_data |>
  ggplot(mapping = aes(x = avg_co2, y = sim)) +
  geom_point(na.rm = TRUE) +
  labs(
    title = "Simulated Life Expectancy per Country based on Linear Regression Model",
    x = expression("Average CO"*""[2]*" Emissions (tonnes per capita)"),
    y = "",
    subtitle = "Average Life Expectancy (years)"
  ) +
  theme_bw() +
  ylim(25, 70)
Code
<<<<<<< HEAD

Figure 5.
=======

Figure 5.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f
Code
<<<<<<< HEAD

Figure 5.
=======

Figure 5.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f

In comparing plots of the observed average life expectancy and simulated average life expectancy against CO\(_{2}\) emissions (Figure 5), their directions and strengths are similar, although the simulated values are noticeably more linear in form as they are based on the SLR model and the observed values follow a curved shape. The simulated data consequently also have lower minimum values and higher maximum values. This tells us that we should be hesitant to say that the linear model it a good fit.

We continued by analyzing \(R^{2}\) values to tell us how well the simulated data matched the observed data. It is calculated on a scale from 0 to 1, with \(R^{2}\) values closer to 1 signifying a stronger model:

Code
nsims <- 1000
sims <- map_dfc(
  .x = 1:nsims,
  .f = ~ tibble(sim = as.vector(predict(project_lm) + rnorm(
    n = 186,
    mean = 0,
    sd = sigma(project_lm)
  )))
)

colnames(sims) <- colnames(sims) |>
  str_replace(
    pattern = "\\.\\.\\.",
    replace = "_"
  )

sims <- project_data |>
  group_by(country) |>
  summarize(avg_life_exp = mean(life_exp)) |>
  filter(!is.na(avg_life_exp)) |>
  select(avg_life_exp) |>
  bind_cols(sims)

sims_r <- sims |>
  map(~ lm(avg_life_exp ~ .x,
    data = sims
  )) |>
  map(glance) |>
  map_dbl(~ .x$r.squared)

sims_r <- sims_r[names(sims_r) != "avg_life_exp"]

tibble(sims = sims_r) |>
  ggplot(aes(x = sims)) +
  geom_histogram(binwidth = 0.025) +
  labs(title = expression("Distribution of Simulated" ~ R^2),
    x = expression("Simulated" ~ R^2),
    y = "",
    subtitle = "Number of Simulated Models"
  ) +
  theme_bw()
<<<<<<< HEAD

Figure 6.
=======

Figure 6.
>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f
<<<<<<< HEAD

The \(R^{2}\) values for simulations based on our model histogram appears to have a normal distribution, with a possible right-skew. It appears to be centered around 0.1791 (Figure 6), which is relatively on the low-side. This indicates the data simulated under this statistical model are weakly similar to what was observed. On average, our simulated data accounts for about 17.91% of the variability in the observed life expectancy.

=======

The \(R^{2}\) values for simulations based on our model histogram appears to have a normal distribution, with a possible right-skew. It appears to be centered around 0.1798 (Figure 6), which is relatively on the low-side. This indicates the data simulated under this statistical model are weakly similar to what was observed. On average, our simulated data accounts for about 17.98% of the variability in the observed life expectancy.

>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f

Conclusion

The \(R^{2}\) value 42.4% of our SLR model (Figure 4) allows us to conclude that, accounting for not even half of the variability, it does not appear to be a particularly strong model of life expectancy.

<<<<<<< HEAD

Similarly, since the \(R^{2}\) value of the observed average life expediencies is relatively small, only about 0.1791, we can conclude that our simulated linear regression model does not match our observed data very well. Thus, it is further supported that our SLR model and does not have very strong predictive ability.

=======

Similarly, since the \(R^{2}\) value of the observed average life expediencies is relatively small, only about 0.1798, we can conclude that our simulated linear regression model does not match our observed data very well. Thus, it is further supported that our SLR model and does not have very strong predictive ability.

>>>>>>> a9400b7708eb60062c20fded6ac16e30e074cc0f

This is understandable due to the fact that our original linear model assumptions were not met - specifically the condition of linearity. Since our original observed data was not very linear, it follows that a linear model would not be the best fit.

In conclusion we have determined that there is likely not a linear relationship between average CO\(_2\) emissions per capita and average life expectancy. However, there may be a different model that would better represent a relationship.